CUDA: enable FA for FP32 KV cache #16546

JohannesGaessler · 2025-10-12T21:22:59Z

Adds CUDA FlashAttention support for FP32 KV cache. The FP32 data is converted to FP16 for the kernel which is inefficient but simple.

CISC · 2025-10-14T12:59:12Z

Slightly off-topic, but would it be doable to add BF16 support for applicable hardware?

JohannesGaessler · 2025-10-14T14:24:30Z

If we're just talking about the bare minimum support as was done in this PR the answer is yes, if you mean with actually good performance the answer is that it depends on the hardware. For NVIDIA you get BF16 tensor cores with Ampere or newer but for hardware-accelerated basic BF16 operations like addition or multiplication you need Hopper or newer (though I would expect internally those are mapped to FP32 instructions). In my opinion it would not be worth the effort to implement support in the kernels themselves.

* cuda : remove legacy copy-op pointer indirection code (ggml-org#16485) * remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function * CUDA: add fp kernel for larger batch size MoE (ggml-org#16512) * CUDA: kernel for larger batch sizes for MoE * WIP * WIP * WIP * WIP * WIP * WIP * fixup * tests * Move mmq_ids_helper to mmid * cleanup * Remove redundant checks * CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557) * CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code * CUDA: enable FA for FP32 KV cache (ggml-org#16546) * vulkan: Improve build time for MSVC (ggml-org#16545) Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel. * vulkan: Support FA with K/V in F32 (ggml-org#16543) * CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577) * vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203) Signed-off-by: Stefan Savic <[email protected]> Co-authored-by: Stefan Savic <[email protected]> * metal : avoid using Metal's gpuAddress property (ggml-org#16576) * metal : avoid using Metal's gpuAddress property * metal : fix rope kernels buffer check --------- Signed-off-by: Stefan Savic <[email protected]> Co-authored-by: Anav Prasad <[email protected]> Co-authored-by: Aman Gupta <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: Jeff Bolz <[email protected]> Co-authored-by: SavicStefan <[email protected]> Co-authored-by: Stefan Savic <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* origin/master: Add server-driven parameter defaults and syncing (ggml-org#16515) metal: optimise `GGML_OP_SUM` (ggml-org#16559) server : fix img token logs (ggml-org#16595) llama-quant: add support for mmproj (ggml-org#16592) CUDA: Changing the CUDA scheduling strategy to spin (ggml-org#16585) server : fix mtmd checkpoints (ggml-org#16591) metal : avoid using Metal's gpuAddress property (ggml-org#16576) vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203) CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577) vulkan: Support FA with K/V in F32 (ggml-org#16543) vulkan: Improve build time for MSVC (ggml-org#16545) CUDA: enable FA for FP32 KV cache (ggml-org#16546) CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557) CUDA: add fp kernel for larger batch size MoE (ggml-org#16512) cuda : remove legacy copy-op pointer indirection code (ggml-org#16485) server : dynamic token limit for prompt cache (ggml-org#16560)

This reverts commit 9c7185d.

CUDA: enable FA for FP32 KV cache

31f2d45

JohannesGaessler mentioned this pull request Oct 12, 2025

metal : FA support F32 K and V #16531

Merged

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 12, 2025

ggerganov approved these changes Oct 14, 2025

View reviewed changes

JohannesGaessler merged commit 9c7185d into ggml-org:master Oct 14, 2025
70 checks passed

yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025

CUDA: enable FA for FP32 KV cache (ggml-org#16546)

4f54d10

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 16, 2025

Revert "CUDA: enable FA for FP32 KV cache (ggml-org#16546)"

7ab7689

This reverts commit 9c7185d.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 16, 2025

Revert "CUDA: enable FA for FP32 KV cache (ggml-org#16546)"

459e687

This reverts commit 9c7185d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: enable FA for FP32 KV cache #16546

CUDA: enable FA for FP32 KV cache #16546

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

Uh oh!

CISC commented Oct 14, 2025

Uh oh!

JohannesGaessler commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CUDA: enable FA for FP32 KV cache #16546

CUDA: enable FA for FP32 KV cache #16546

Uh oh!

Conversation

JohannesGaessler commented Oct 12, 2025

Uh oh!

Uh oh!

CISC commented Oct 14, 2025

Uh oh!

JohannesGaessler commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants